Similitud entre Productos
¶La similitud entre productos puede ser una oportunidad en la cual puede sacar provecho tanto el comprador como Mercado Libre en su plataforma siempre que se identifiquen de manera apropiada. Para un usuario puede facilitar su decisión de compra al hacer comparables los mismos productos, o aquellos con un grado de similaridad alta. En cuanto a la página e información que esto aporta a Mercado Libre se pueden, por ejemplo proponer estrategias para fijar un rango de precios en productos que se sabe son iguales y así mismo identificar aquellos que no como posibles fallos en la publicación del vendedor.
Para llevar a cabo esta importante labor, en el reto contamos con dos conjuntos, uno llamado items_titles con más de 30mil nombres de productos para Mercado Libre Brasil. Por otra parte, para el conjunto de items_titles, con cerca de 10mil registros es sobre el cual se quiere generar el entregable donde se encuentren todas sus parejas de productos ordenadas por score de similaridad.
La forma más eficiente de llevar a cabo una comparación entre dos cadenas de textos es a través de los embeddings, los cuales en pocas palabras convierten la información relevante del texto en vectores de números que permiten realizar operaciones matemáticas entre sí, como lo es el método de similitud del coseno para identificar cercanías semánticas entre sí.
Entrenar y construir un modelo de este tipo requiere de un esfuerzo y recurso computacional grande, que por fortuna ya fueron realizados por grandes empresas y puestos a disposición para el uso en este tipo de casos. Lo anterior significa, que probaremos con modelos ya existentes y validados directamente en el conjunto de test, que al momento de generar las combinaciones de parejas ya contaría con cerca de 500 millones de registros en sus filas.
Esta técnica nos permite optimizar esfuerzos y recursos computacionales, garantizando unos resultados consistentes. Vamos a realizarlo:
''''
!pip install itables
!pip install -U sentence-transformers
!pip3 install seaborn
!pip install absl-py
!pip install tensorflow
!pip install tensorflow_hub
Requirement already satisfied: itables in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (1.6.2) Requirement already satisfied: IPython in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from itables) (8.15.0) Requirement already satisfied: pandas in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from itables) (2.0.3) Requirement already satisfied: numpy in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from itables) (1.24.3) Requirement already satisfied: backcall in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (0.2.0) Requirement already satisfied: decorator in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (5.1.1) Requirement already satisfied: jedi>=0.16 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (0.18.1) Requirement already satisfied: matplotlib-inline in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (0.1.6) Requirement already satisfied: pickleshare in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (0.7.5) Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (3.0.36) Requirement already satisfied: pygments>=2.4.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (2.15.1) Requirement already satisfied: stack-data in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (0.2.0) Requirement already satisfied: traitlets>=5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (5.7.1) Requirement already satisfied: colorama in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from IPython->itables) (0.4.6) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from pandas->itables) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from pandas->itables) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from pandas->itables) (2023.3) Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from jedi>=0.16->IPython->itables) (0.8.3) Requirement already satisfied: wcwidth in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->IPython->itables) (0.2.5) Requirement already satisfied: six>=1.5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas->itables) (1.16.0) Requirement already satisfied: executing in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from stack-data->IPython->itables) (0.8.3) Requirement already satisfied: asttokens in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from stack-data->IPython->itables) (2.0.5) Requirement already satisfied: pure-eval in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from stack-data->IPython->itables) (0.2.2) Requirement already satisfied: sentence-transformers in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (2.2.2) Requirement already satisfied: transformers<5.0.0,>=4.6.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (4.32.1) Requirement already satisfied: tqdm in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (4.65.0) Requirement already satisfied: torch>=1.6.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (2.1.1) Requirement already satisfied: torchvision in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (0.16.1) Requirement already satisfied: numpy in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (1.24.3) Requirement already satisfied: scikit-learn in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (1.3.2) Requirement already satisfied: scipy in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (1.11.1) Requirement already satisfied: nltk in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (3.8.1) Requirement already satisfied: sentencepiece in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (0.1.99) Requirement already satisfied: huggingface-hub>=0.4.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sentence-transformers) (0.15.1) Requirement already satisfied: filelock in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.9.0) Requirement already satisfied: fsspec in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2023.4.0) Requirement already satisfied: requests in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (2.31.0) Requirement already satisfied: pyyaml>=5.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (6.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.7.1) Requirement already satisfied: packaging>=20.9 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (23.1) Requirement already satisfied: sympy in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from torch>=1.6.0->sentence-transformers) (1.11.1) Requirement already satisfied: networkx in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1) Requirement already satisfied: jinja2 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from torch>=1.6.0->sentence-transformers) (3.1.2) Requirement already satisfied: colorama in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tqdm->sentence-transformers) (0.4.6) Requirement already satisfied: regex!=2019.12.17 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2022.7.9) Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.13.2) Requirement already satisfied: safetensors>=0.3.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.3.2) Requirement already satisfied: click in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from nltk->sentence-transformers) (8.0.4) Requirement already satisfied: joblib in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from nltk->sentence-transformers) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from scikit-learn->sentence-transformers) (2.2.0) Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from torchvision->sentence-transformers) (10.0.1) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from jinja2->torch>=1.6.0->sentence-transformers) (2.1.1) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers) (2023.11.17) Requirement already satisfied: mpmath>=0.19 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from sympy->torch>=1.6.0->sentence-transformers) (1.3.0) Requirement already satisfied: seaborn in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (0.12.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from seaborn) (1.24.3) Requirement already satisfied: pandas>=0.25 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from seaborn) (2.0.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from seaborn) (3.7.2) Requirement already satisfied: contourpy>=1.0.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5) Requirement already satisfied: cycler>=0.10 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (10.0.1) Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Requirement already satisfied: absl-py in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (2.0.0) Requirement already satisfied: tensorflow in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (2.15.0) Requirement already satisfied: tensorflow-intel==2.15.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow) (2.15.0) Requirement already satisfied: absl-py>=1.0.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.0.0) Requirement already satisfied: astunparse>=1.6.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=23.5.26 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (23.5.26) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.5.4) Requirement already satisfied: google-pasta>=0.1.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.2.0) Requirement already satisfied: h5py>=2.9.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (3.9.0) Requirement already satisfied: libclang>=13.0.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (16.0.6) Requirement already satisfied: ml-dtypes~=0.2.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.2.0) Requirement already satisfied: numpy<2.0.0,>=1.23.5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.24.3) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (3.3.0) Requirement already satisfied: packaging in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (23.1) Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (4.23.4) Requirement already satisfied: setuptools in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (68.0.0) Requirement already satisfied: six>=1.12.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.16.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.4.0) Requirement already satisfied: typing-extensions>=3.6.6 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (4.7.1) Requirement already satisfied: wrapt<1.15,>=1.11.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.14.1) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (0.31.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (1.60.0) Requirement already satisfied: tensorboard<2.16,>=2.15 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.15.1) Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.15.0) Requirement already satisfied: keras<2.16,>=2.15.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow-intel==2.15.0->tensorflow) (2.15.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from astunparse>=1.6.0->tensorflow-intel==2.15.0->tensorflow) (0.38.4) Requirement already satisfied: google-auth<3,>=1.6.3 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.25.2) Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.2.0) Requirement already satisfied: markdown>=2.6.8 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.4.1) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.31.0) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.2.3) Requirement already satisfied: cachetools<6.0,>=2.0.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (5.3.2) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.2.8) Requirement already satisfied: rsa<5,>=3.1.4 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (4.9) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.3.1) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2023.11.17) Requirement already satisfied: MarkupSafe>=2.1.1 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (2.1.1) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow-intel==2.15.0->tensorflow) (3.2.2) Requirement already satisfied: tensorflow_hub in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (0.15.0) Requirement already satisfied: numpy>=1.12.0 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow_hub) (1.24.3) Requirement already satisfied: protobuf>=3.19.6 in c:\users\garzonm1\appdata\local\anaconda3\lib\site-packages (from tensorflow_hub) (4.23.4)
from itables import init_notebook_mode
import pandas as pd
import numpy as np
from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import os
import re
import seaborn as sns
init_notebook_mode(all_interactive=True)
WARNING:tensorflow:From C:\Users\Garzonm1\AppData\Local\anaconda3\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead. WARNING:tensorflow:From C:\Users\Garzonm1\AppData\Local\anaconda3\Lib\site-packages\tensorflow_estimator\python\estimator\util.py:74: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead. WARNING:tensorflow:From C:\Users\Garzonm1\AppData\Local\anaconda3\Lib\site-packages\tensorflow_hub\native_module.py:92: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead. WARNING:tensorflow:From C:\Users\Garzonm1\AppData\Local\anaconda3\Lib\site-packages\tensorflow_hub\saved_model_module.py:40: The name tf.saved_model.constants.LEGACY_INIT_OP_KEY is deprecated. Please use tf.compat.v1.saved_model.constants.LEGACY_INIT_OP_KEY instead.
# Importar y ver un poco del conjunto grande
items_titles = pd.read_csv ("C:/Users/Garzonm1/Desktop/Teste Técnico - DS/items_titles.csv")
items_titles
| ITE_ITEM_TITLE |
|---|
| Loading... (need help?) |
# Importar el conjunto donde vamos a implementar los modelos
items_titles_test = pd.read_csv ("C:/Users/Garzonm1/Desktop/Teste Técnico - DS/items_titles_test.csv")
items_titles_test.size
10000
items_titles_test
| ITE_ITEM_TITLE |
|---|
| Loading... (need help?) |
Sentence transformers: Aplican redes neuronales (BERT) para incrustar oraciones en vectores que facilitan la identificación eficiente de títulos similares mediante la comparación de su contenido semántico codificado.
Universal sentence encoder: Transforma oraciones en vectores densos, utilizando un entrenamiento profundo y diverso que capta matices lingüísticos, lo que lo hace ideal para identificar títulos parecidos incluso con variaciones sutiles en el lenguaje.
Se escogió el modelo "paraphrase-multilingual-MiniLM-L12-v2" por ser aquel con mejor desempeño que aplicara para portugués, idioma en el que se encuentran los datos que acabamos de observar.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') #multi-language model
#Convertir los items a la lista para luego hacer el embedding del título
items = items_titles_test['ITE_ITEM_TITLE'].values.tolist()
embedding = model.encode(items, convert_to_tensor=False)
embedding.shape
(10000, 384)
cosine_scores = util.cos_sim(embedding, embedding)
# Transformar el cosine similarity de -1 a 1 a un rango de 0 a 1
cosine_similarity_normalized = 0.5 * (cosine_scores + 1)
#Almacenar los resultados
result_list = []
#Recorrer la lista de items comparando pares de productos entre sí
for i, v1 in enumerate(items):
for j, v2 in enumerate(items):
if i >= j:
continue
item1 = v1
item2 = v2
score = cosine_similarity_normalized[i][j].item()
result_list.append((item1, item2, score))
# Crear DataFrame de resultados
df = pd.DataFrame(result_list, columns=['item1', 'item2', 'score'])
# Ordenar por score
df_menores = df.sort_values(by='score', ascending=True).reset_index(drop=True)
df_menores.head()
| item1 | item2 | score |
|---|---|---|
| Loading... (need help?) |
# Generar el DataFrame de mayor a menor en similaridad
similar_items_test = df.sort_values(by='score', ascending=False).reset_index(drop=True)
similar_items_test.head(10)
| item1 | item2 | score |
|---|---|---|
| Loading... (need help?) |
# Exportar csv
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
def embed(input):
return model(input)
titles_list = items_titles_test['ITE_ITEM_TITLE'].tolist()
titles_list = titles_list[:3000]
# Reduce logging output.
logging.set_verbosity(logging.ERROR)
message_embeddings = embed(titles_list)
def compare_pairs(labels, features):
corr = np.inner(features, features)
comparisons = []
for i in range(len(labels)):
for j in range(i + 1, len(labels)):
comparisons.append({
'item1': labels[i],
'item2': labels[j],
'score': corr[i, j]
})
return pd.DataFrame(comparisons)
def run_and_compare(messages_):
message_embeddings_ = embed(messages_)
return compare_pairs(messages_, message_embeddings_)
df_comparisons = run_and_compare(titles_list)
df_comparisons = df_comparisons.sort_values(by='score', ascending=False).reset_index(drop=True)
df_comparisons
| item1 | item2 | score |
|---|---|---|
| Loading... (need help?) |
Usando este método, se excedió la capacidad computacional al usar los 10 mil registros que se encuentran en train, sin embargo se ejecutó para los primeros 3mil registros con el fin de observar que arroja unos resultados similares a los del primer modelo mencionado y que a simple vista parecen tener mucho sentido.
Si bien los modelos aplicados ya están validados y entrenados, con el objetivo de hacerlos más específicos a nuestro casi de uso se recomienda buscar la manera de validar los resultados (ej: si se va a mostrar en el marketplace, boton manito arriba si el producto se parece a lo que está buscando) y de esta manera hacer un "Fine Tuning" del modelo que se ajuste a los productos en Mercado Libre.
Para mejorar el rendimiento computacional, implementar metodologías como Árboles KD que permiten hacer la búsqueda más eficiente reduciendo el espacio de búsqueda en sistemas de recomendación, no comparando con los millones de productos sino con los más cercanos en su partición.
Complementar la información de los títulos de los productos con adicional como las categorías que fueron mencionadas, o las imágenes usadas en la publicación, para unirlos y usar modelos que combinan imagen y texto como lo es por ejemplo clip-ViT-B-32-multilingual-v1 que además de funcionar para múltiples idiomas incluido el portugués, es capaz de identificar relaciones entre las imágenes y el texto generando resultados más robustos.